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INTRODUCTION 

The  theme  of  this  lecture  is  that  the  availability  of 
very  high  density  integrated  circuits  will  be  changing  our 
approach  and  emphasis  to  several  problems  in  estima¬ 
tion  and  control. 

For  example,  the  minimality  of  realizations  will  be 
less  significant  than  their  modularity,  local  interconnect¬ 
edness.  area  time  complexity  measures,  etc.  Similarly, 
good  algorithms  for  serial  processing  may  be  poor  candi¬ 
dates  for  parallel  implementation. While  it  is  hard  for  me 
in  mid-August  to  predict  exactly  what  I  shall  say  in  the 
lecture  in  mid-December.  1  think  it  might  be  useful  to 
provide  in  written  form  some  of  the  background  material 
on  which  a  good  part  of  my  talk  will  be  based.  Thus,  at 
this  meeting  at  least.  I  plan  to  illustrate  the  above  points 
by  several  examples,  including: 

1)  description  of  a  parallel  architecture  for  the  meas¬ 
urement  update  step  (in  triangular  array  form)  of 
the  Kalman  filter. 

2)  development  of  the  Schur  algorithm  as  a  better  can¬ 
didate  than  the  Levinson  algorithm  for  VLSI  imple¬ 
mentation  of  Toeplitz  equation  solvers, 

3)  comparison  of  the  Berlekamp-Massey-Rissanen  and 
Lanczos  algorithms  in  the  problems  of  partial  reali¬ 
zation  and  of  the  decoding  of  BCH  codes, 

4)  development  of  minimal,  but  pipelined  and 
orthogonally-cascaded,  implementations  of  time- 
invariant.  finite-dimensional  (ARMA)  systems. 

On  the  other  hand,  while  much  of  the  development 
and  demonstration  of  new  parallel  computing  structures, 
such  as  systolic  arrays  and  dataflow  machines,  has  been 
carried  out  by  computer  scientists,  I  hope  to  show  (by 
examples)  that  system  theori  ts  can  contribute  to  the 
understanding  and  analysis  of  such  structures  and  help 
develop  more  efficient  ones. 

1.  A  PARALLEL  ARCHITECTURE  FOR  MEASUREMENT 
UPDATES 

This  part  of  the  talk  will  be  based  upon  a  paper  "A 
Parallel  trchitecture  for  Kalman  Filter  Measurement 
Update,"  by  J.  lover  and  T.  Kailath,  June  1993. 

The  main  theme,  in  this  example  and  some  of  the 
later  ones,  is  to  show  how  reformulation  of  an  algorithm 

^  This  talk  will  be  band  in  part  on  worii  tupportad  in  recent  years  at 
Stanford  by  the  Air  Force  Office  of  Scientific  Reaaarcn,  the  Arrr.y 
Raa* arch  Office,  the  Joint  Serv.cea  Electronic*  Program,  tna  Defense 
Advanced  Research  Budget.*  Agency,  and  th*  National  Science  Founda¬ 
tion. 


can  lead  to  more  efficient  implementation. 

In  this  paper  we  present  a  parallel  computing  struc¬ 
ture  of  the  systolic  array  type  for  implementing  a  new 
algorithm  for  the  measurement  update  step  of  the  Kal¬ 
man  filter  for  state-space  estimation.  With  a  single  serial 
processor,  the  update  of  a  scalar  measurement  would 
take  time  0(n2).  where  n  is  the  state  dimension:  we 
present  an  array  with  2n+4  elementary  processors  and  a 
bank  of  delay  units,  that  will  carry  out  the  measurement 
update  in  time  O(n).  More  savings  can  be  realized  by  an 
extended  architecture  that  will  update  a  p  -dimensional 
measurement  with  O(np)  processors  in  time 
0(max\n.  p\),  instead  of  time  0(maz\nzp .  npz,  p3 J)  for 
the  single-processor  implementation. 

The  only  earlier  work  that  we  are  aware  of  in  this 
direction  is  that  of  Andrews  (1991)  who  developed  a 
parallel  structure  with  0(nz)  elementary  processors  for 
implementing  the  so-called  U-D  algorithm  tor  the  meas¬ 
urement  update  by  rearranging  the  order  of  computa¬ 
tion  of  certain  equations  given  by  Bierman  (1975), 
(1977).  Our  structure  allows  us  to  update  not  only  the 
covariance  factors  but  also  the  state  estimates  them¬ 
selves.  and  also  has  other  advantages  over  Andrews 
scheme. 

The  new  structure  was  in  fact  suggested  by  a 
different  way  (Kailath  (1992))  of  carrying  out  the  U-D 
measurement  update--using  triangularization  of  an 
(tl+p)x(n+p)  matrix  via  Modified  Givens  rotations  (as 
given  by  Gentleman  (1973)).  It  can  be  shown  that  for  a 
scalar  measurement  and  a  serial  processor.  Bierman's 
equations  are  equivalent  to  our  triangularization 
method.  However,  the  fact  that  there  are  no  explicit 
equations  in  our  scheme  seems  to  make  it  easier  to  con¬ 
ceive  of  a  parallel  computing  structure.  The  architec¬ 
ture  we  propose  is  of  the  wavefront  or  systolic-type  (see. 
e.g..  Mead  and  Conway  (19B0),  M.  T.  Kung  (1982),  S.  Y. 
Kung  (1992)). 

Moreover,  it  is  significant  that  our  algorithm  is 
essentially  the  same  whether  we  have  scalar  or  vector 
measurements,  while  to  our  knowledge  it  is  difficult  to 
extend  Bierman's  equations  to  the  vector  case.  The 
main  reason  is  that  Bierman's  derivation  is  based  on  a 
formula  of  Agee  and  Turner  (1972)  for  updating  the  l.DU 
factors  of  a  rank-one  perturbation  of  a  given  matrix;  with 
a  p-dimensional  measurement,  we  have  a  rank  p  update 
and  there  is  no  simple  extension  of  the  Agee-Turner  for¬ 
mula  to  this  case.  The  usual  way  around  this  is  to  pro¬ 
cess  the  measurements  sequentially,  after  some  prelim¬ 
inary  data  transformation,  but  a  direct  parallel  imple¬ 
mentation  will  then  take  time  0(np )  as  opposed  to  the 
pj)  time  for  our  structure. 

We  might  mention  that  the  ideas  and  schemes  of 
this  paper  can  also  be  used  for  other  forms  of  the 
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measurement  update,  e.g.  using  strictly  triangular  fac¬ 
tors  (with  scalar  arithmetic  square  roots)  rather  than 
the  LDU  forms.  Parallel  architectures  for  time-update 
calculations  can  also  be  obtained,  though  apparently 
with  0(nz)  processors  rather  than  0(n);  since  this  ques¬ 
tion  is  still  under  study,  we  restrict  ourselves  in  this 
paper  to  the  measurement  update  problem. 

1.1  MEASUREMENT  UPDATE  AND  BLKRMAN'S  U-D  EQUA¬ 
TIONS 

We  use  the  by  now  almost  standard  notation  intro¬ 
duced  by  Kalman  (1960),  in  terms  of  which  we  can  state 

the 

MEASUREMENT  UPDATE  EQUATIONS. 


=  +  ^/.i(v»  ~  Hitt)  .  i>  0  (la) 

r0:=0  (lb) 

K/.i  (2) 

R.  i  -  +  Pt  (3) 

Pi i,  ~Pi~  PiHirP.:lHiPi  (4a) 

Po-=no  (<b) 

TIME  UPDATE  EQUATIONS: 

*.m  *  .  i  *  0  (5) 

/»*♦!  *  FiPinFf+  WiG?  (8) 


Several  alternatives  to  Eqns.  (l)-(6)  have  been  sug¬ 
gested.  based  on  the  idea  (first  used  by  Potter  (1963))  of 
propagating  square-root  factors  of  Px  and  Pllt.  There 
are  several  forms  of  such  algorithms  (see  Kaminski.  Bry¬ 
son.  and  Schmidt  (1971);  Morf  and  Kailath  (1975);  Bier- 
man  (1977)).  In  particular  Bierman  modified  a  measure¬ 
ment  update  algorithm  derived  by  Carlson  (1973)  so  that 
it  did  not  use  any  arithmetic  square  roots— a  feature  that 
may  be  important  in  certain  implementations.  This  algo¬ 
rithm  is  based  on  working  with  LDU  (lower-triangular, 
diagonal,  upper-triangular)  factorizations  of  P,  and 
Pm.  Since  Px  and  Pm  ®re  symmetric.  U  =  Ll Bier- 
man  uses  the  UDUT  form  and  calls  his  version  of  the 
algorithm /or  computing  the  }t/.  £>|  factors  of  Pm  from 
the  \  U.  D |  factors  of  P^lhe  U-D  algorithm. 

Herman's  Equations 

Let 

Pt  =  LDLt  .  Pm  =  L.D.Ll 

Substituting  these  expressions  into  the  measure¬ 
ment  update  Eqn,  (4)  gives 

LtD,LT,  =  L  (  D  -  DLTHiTP.:}H(LD  )  LT  =  LDlr  .  say 

Now  if  we  have  only  a  single  measurement,  i.e.  p  =  1, 
then  Ht  will  be  a  row  matrix,  say 

Hi  —  Ki  ,  P9ii  —  r,  (  ,  Pi  —  r( 

and  if  we  now  factor  D  as 

D  =  LDtT  =  D  -  DLThir;'ih?LD 

then  clearly 

L*  =  Lt  .  Dt  =  D 

Note  that  D  is  a  rank-one  modification  of  D:  in  1972 
Agee  and  Turner  showed  how  to  compute  the  )£,  D\  fac¬ 
tors  of  such  a  matrix.  Incorporating  a  numerical 
improvement  suggested  by  Carlson  (1973)  (in  particular 
replacing  a  subtraction  by  a  division)  Bierman  (1975) 
obtained  the  following  set  of  equations  for  the  scalar 
measurement  update  (in  a  different  notation). 


In  the  following,  brackets  will  refer  to  a  particular 
element  in  a  matrix  or  vector,  e  g.,  /.,[fc.y)  is  the 

(A-'./Mh  element  of  the  matrix  I. ,;  the  diagonal  ele¬ 
ments  will  be  denoted  by  a  single  index,  r.g.  I)[j  ): 

Define  :  bT  -  h.T L 

Initialize :  a„  =  0  (  n  x  1  vector  ) 

bn  -  r  {scalar) 

Iterate  for  }  =n,n-l . 1 

hj-i  =  b,  +  62[j)  D\)} 

Di[]  \  =  4>  D\j]/b,.x 

0  =  b\j)D[j]/^  , 

£.[•■>]  =  £[*.;  1  -  b[]  ]  a} 

“j-1  =  “j  *■  0  £.[•;! 

Examination  of  this  algorithm  shows  that  it  takes 
0(nz)  additions  and  multiplications  and  0(n)  divisions  to 
go  from  j L.  D\  to  }£».  D.\  (see  Bierman  (1977) 
P  107). 

Andrews'  Parallel  Implementation 

With  the  advent  of  VLSI  technology,  it  was  natural  to 
ask  if  the  processing  time  could  be  reduced  by  using 
parallel  processing  arrays,  c.g.  of  the  systolic  type.  A 
systolic-type  network  (see  H.T.  Kung  (19H2))  is  charac¬ 
terized  by  the  smooth  flow  of  data  through  a  network  of 
simple  Processing  Elements  (PEs)  with  only  local  com¬ 
munication  among  the  PEs  and  with  limited  access  to 
external  data.  Such  an  architecture  is  convenient  for 
VLSI  implementation  because  of  the  regularity  and  sim¬ 
plicity  of  the  PL’s  and  the  short  path  of  communication 
links.  However,  the  algorithms  that  can  be  mapped  into 
such  an  architecture  must  be  such  that  maximum 
advantage  is  taken  of  the  data  available  at  a  given  time 
in  a  PE.  [We  may  remark  here  that  by  the  term 
systolic-type  networks  we  are  not  making  any  distinction 
as  to  whether  the  final  implementation  will  use  the  syn¬ 
chronous  systolic  arrays  of  H.T.  Kung  (1962)  or  the  asyn¬ 
chronous  Wavefront  Array  Processors  of  S.Y.  Kung 
(1982)— the  distinction  really  depends  upon  the  size  of 
the  matrix.] 

A  first  attempt  at  a  parallel  implementation  was 
made  by  Andrews  (l99l),  who  modified  the  order  of  exe¬ 
cution  of  some  of  the  equations  in  the  Bierman  algorithm 
and  gave  a  data  flow  diagram.  However,  his  scheme  can¬ 
not  be  implemented  using  simple  and  synchronized  Pro¬ 
cessing  Elements  (PEs)  as  in  a  systolic-type  network.  As 
indicated  by  Andrews'  attempt,  Bierman's  version  (1975) 
of  the  U—D  measurement  update  does  not  appear  to  be 
the  best  form  for  suggesting  a  parallel  implementation. 
We  shall  describe  in  the  next  section  how  another  look  at 
the  problem  yielded  a  new  algorithm,  with  no  explicit 
equations,  that  suggests  some  satisfactory  parallel 
implementations. 

1.2  A  NEW  AICOKITIIM  FOR  LDU  MEASUREMENT  UPDATE 

It  will  be  useful  to  illustrate  the  difficulty  that  led  to 
the  Bierman  equations  for  the  measurement  update  by 
first  showing  how  to  obtain  a  square-root  algorithm  for 
the  time-update  equation  (6), 

Pm*  Pi  if\T  *  Qt  CIi 

For  any  nonnegative-definite  matrix  A,  wc  shall 
define  a  lower  triangular  matrix  A,/*  as  its  unique  lower 
triangular  square  root  factor  if 

A  =  A''*At/*  ;  Ar/i  =  (A,/l)T 

and  (for  uniqueness)  A,/*  has  nonnegative  diagonal 
entries. 

Then  we  claim  the  following:  the  lime  update  is 


solved  by  finding  any  orthogonal  matrix  9  to  triangular- 
ize  the  array  [ FiPlf. *  C4ftl/Jl,  i.e.  such  that 

C^,/2]9  =  [Pl'l  0]  (7) 


This  claim  can  be  immediately  verified  by  "squaring" 
both  sides  of  Eqn.  (7). 

Computing  the  triangular  factors  and  P,'/ l 

may  imply  taking  arithmetic  square  roots,  which  are 
often  somewhat  more  expensive  to  compute  than  multi¬ 
plications  or  divisions,  and  therefore  are  sometimes 
avoided.  This  can  be  done  by  using  LDV  factorizations, 
Kailath  (1982)  noted  that  the  difficulty  with  the 
minus  sign  in  the  update  measurement  (Eqn.  (4a)) 

Px  i*  =  p>  -  p>HiTp.:lHxpi 

could  be  avoided  by  first  noting  that  the  right  hand  side 
above  is  the  so-called  Schur  complement  of  R,  ,  in  the 
(larger)  matrix 

R.  i  Mi  Pi 
Pi  Hi  Pi 


This  Schur  complement  arises  in  the  block  LDU  factori¬ 
zation  of  this  matrix;  it  is  easy  to  check  the  decomposi¬ 
tion 


/  0 

R.x  o 

I 

Kj.i 

M/.i  I 

o  Alt 

Hi 

1 

Kf  .i  =  Pi  Hi  R.  i 

On  the  other  hand,  we  can  also  consider  the  Schur 
complement  of  Pt  in  U,  which  will  be  (see  Eqn.  (3)) 

R.  i  -  HiPiPi  'PiHl  =  Ri 


corresponding  to  the  block  LDU  factorization 


From  these  two  block  factorizations  of  U.  we  can  con¬ 
clude  that  there  must  be  an  orthogonal  transformation 
matrix  9  such  that 


B?  HiP? 

0  P? 


P?i 

PiHlR.-J'3 


0 

P?i 


(8) 


This  is  in  fact  the  form  of  a  well-known  square  root  algo¬ 
rithm  due  to  Dyer  and  McReynolds  (1969). 

As  mentioned  before,  the  transformation  8  in  such 
algorithms  may  involve  the  use  of  scalar  square  roots, 
and  in  some  situations  it  is  useful  to  be  able  to  avoid 
them.  This  can  be  attempted  by  using  the  LDU  decompo¬ 
sitions 


R  =  LkDrL$.  Pi  =  LDLt  .  Pi u  *  L.Dill 


and  rewriting  the  above  expression  as  (dropping  all 
time-index  subscripts  for  convenience) 

It/* 


Lr 

ML 

0 

L 

Off  0 
0  D 
£r.  0 

tf/T'R,  £» 


9 

Hr.  0 

0  D. 


i/s 


(9a) 


or 


iff 

0 


L  n»  0 

Kj  Lea  L. 


(9b) 


Kailath  (1982)  showed  how  to  use  an  algorithm  due  to 
Gentleman  (1973)  to  find  9/j  as  a  product  of  elementary 
matrices  none  of  which  contains  any  scalar  square  roots. 
In  the  case  of  scalar  measurements  (pal),  writing  out 


the  stepSexplicitly  (which  is  not  necessary  in  the  array 
methods)  will  lead  exactly  to  the  U  - D  equations  of 
Dierman's  (1975),  which  were  obtained  in  the  quite 
different  way  described  before. 


The  Givens  and  Modified  Givens  Methods 

The  Gi.ens  method  of  matrix  triarigularization  is  to 
note  that  we  can  readily  find  an  elementary  orthogonal 
transformation  to  rotate  any  given  1x2  vector  [  p,  p-  ] 
to  make  it  lie  along  the  first  coordinate  axis.  In  fact, 
note  that 


[Pi  Pal 


Pi 

Vp?+p2f 


~Pa 

Vp?  +p7 


-Pa 

Vpf +p| 

Pi 

vp?  +p| 


[Vpp+p/  0] 


By  systematically  applying  a  sequence  of  such  elemen¬ 
tary  transformations  we  can  triangularize  any  given 
matrix,  as  any  uncertain  reader  can  check  on  a  simple 
example. 

The  issue  is  how  to  avoid  the  (scalar)  square  roots  m 
the  transformation.  Gentleman  (1973)  showed  that  this 
could  be  done  by  introducing  weighted  norms:  rotate 
[Pi  Pa  1  l-0  he  along  [  1  0  J.  keeping  equality  of  the 
weighted  norms 


[pi  Pal 


• ! 

p, 

0  dpe 

Pa 

[  1  0] 

0 

0 

dqz 

M 

0 


(10) 


In  most  problems  \dpX.  dp2\  will  be  given  (as  for  exam¬ 
ple  in  (9))  and  dq2\  have  to  be  determined.  It  is 

not  hard  to  see  what  to  do.  We  need  to  find  an  orthogo¬ 
nal  matrix  9  such  that 
1 
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or  equivalently  to  find  Q0  such  that 

[Pi  Pal  =  [1  0  1 

where 


9d  = 


c*  ♦  s*  s  1 

Now  note  that  from  (10)  is  determined  as 

<*,.  =  P?^,i  +Pl<<p  a 
.Vext  it  is  easy  to  check  that 

1 

Pidpi/  d,,  -p  2 
Pzdpt/dq,  pi 


d? i 

c  -s 

dp?  0 

s  c 

o  dq? 

fPi  Pal 


=  [10 


(Ha) 

(lib) 

(11c) 

(12) 

(13) 


So  we  have  a  candidate  for  Q0  and  the  only  issue  is  if 
(cf.  (11b)) 
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This  constraint  can  be  met  by  choosing 


c*  +  sJ  =  1 


4.  •  ^ 


(14) 


To  apply  the  transformation  in  Eqn.  (13)  to  an  arbi¬ 
trary  vector  [p pi ]. 


[pi  Pal 


Pidpt/d,,  ~Pi 

p2dp2/ dti  Pi 


=  [?  1  9s  1 


(15) 


will  require  four  multiplications  and  two  additions.  In 
many  applications,  including  ours,  it  happens  thatp!  =  1 
and  for  this  case  Golub  (see  Gentleman  (1973))  noted 
that  one  could  manage  with  two  multiplies  and  two  adds. 
To  see  this  note  that  we  can  write 


9z  =  “P2Pi  +  P2 

dp  i  .  dpz  _ , 

9 1  =  ?— Pi  +P 2  T~~Pz 

dt,  i 

=  ^~P  ,  *  Pi  +  PiPJ 


(16a) 
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dp, 
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Since  p2d_2/  d, ,  is  given  as  an  element  of  the  matrix 
(15).  we  see  that  (16a)  and  (16b)  each  requires  only  one 
multiply  and  one  add. 

This  is  the  Modified  Givens  transformation  intro¬ 
duced  by  Gentleman  ( 1973);  because  scalar  square  roots 
are  avoided,  the  use  of  the  modified  Givens  transforma¬ 
tion  generally  yields  a  speedup  in  computation,  leading 
to  the  name  fast  Givens  transformation. 

An  Example 

The  following  numerical  example  shows  how 
modified  Givens  transformations  can  be  used  to  zero  out 
one  row  of  elements  in  a  matrix  with  the  same  structure 
as  in  the  measurement  update  algorithm  presented  in 
this  paper  (Eqn.  (9)).  with  H  a  row  vector  (scalar  meas¬ 
urement  update)  and  Lp.Dg.Lgp,  and  Or,  scalars.  To  tri- 
angularize  the  matrix  we  will  zero  out  all  but  the  first 
element  of  the  first  row.  and  to  avoid  fill-in  we  will  start 
zeroing  out  the  element  (1.4)  [using  the  first  and  forth 
columns].  Then  we  will  zero  out  the  element  (1.3)  [using 
the  first  and  third  columns]:  and  so  on.  Let 

=>  0p(l.  n  +  1)  9p(l.  n)  ®p(l.2) 

where  the  elementary  rotation  places  a  zero  in 

the  (i.  ))  element  of  the  matrix,  and  n  is  the  dimension 
of  the  state  vector  (in  this  numerical  example  n  =  3). 
We  write  in  the  diagonal  matrices  for  convenience  of 
reference: 
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In  the  next  section  we  describe  a  parallel  architec¬ 
ture  for  the  Kalman  filter  measurement  update  with  the 
same  structure  as  this  numerical  example. 

1.3  ARCHITECTURE  TOR  PARALLFJ.  MEASUREMENT 
UPDATE 

It  might  seem  that  our  problem  could  be  solved  by 
applying  existing  architectures  for  triangul.iri/mg 
matrices.  Gentleman  and  Kung  (1981)  presented  a  tri¬ 
angular  systolic  array  for  this  purpose;  so  did  Ahmed. 
Delosme,  and  Morf  (1982).  who  used  a  structure  based  on 
C0RD1CS  as  the  basic  Processing  Element.  However,  our 
problem  has  special  structure  and  requires  to  do  addi¬ 
tional  computations  (like  determining  the  product 
hT  L).  Vie  shall  combine  all  these  special  requirements 
in  an  architecture  that  (i)  solves  the  problem  with  a 
linear  number  of  processors  (as  opposed  to  0(n2)  in  gen¬ 
eral),  (ii)  computes  the  error  covariance  and  slate 
updates  at  the  same  time,  and  (iii)  avoids  a  significant 
bottleneck  in  many  systolic  networks-namely.  that  the 
Processing  Elements  (usually  on  the  boundary)  that  have 
to  do  more  complex  calculations  slow  down  the 
throughput  rate  for  the  whole  system. 

For  simplicity,  we  start  with  the  scalar  measure¬ 
ment  update  (so 

tf  =  Ar.  L*=l.  Dr=t.  £r.=  1.  Dr,  -  r. ).  In  the 
next  section  we  extend  this  architecture  to  the  vector 
case.  The  architecture  we  propose  is  depicted  in  Figure 
1,  where  for  simplicity  of  illustration,  and  without  loss  of 
generality,  we  have  taken  the  number  of  states,  n.  to 
be  4. 

From  Eqn.  (9)  we  see  that  we  have  to  compute  the 
product  hT  L  =  bT  and  then  zero  each  of  the  com¬ 
ponents  of  the  vector  6 .  Each  lime  we  input  a  set  of 
data  to  our  architecture,  we  shall  zero  out  one  element 
of  the  vector  6.  using  the  transformation  0p(l,  >  +  l).  as 
in  the  example  of  the  previous  section. 

Eqn.  (9)  suggests  that  the  architecture  should  have 
four  parts.  First,  a  column  of  elementary  processors  at 
the  left  whose  function  is  to  compute  the  inner  product 
hT  ■  L  -  bT.  Second,  a  processor  at  the  top  of  Figure  1 
whose  objective  is  to  compute  the  parameters  for  the 
elementary  transformation.  9p{l.j  +  l).  Third,  a  column 
of  processors  (at  the  right  of  the  figure)  to  apply  the 
transformation.  Finally,  a  delay  network  that  will  delay 
the  values  of  the  matrix  /,  until  the  appropriate 
moment  to  apply  the  transformation. 

A  more  detailed  description  of  the  different  Process¬ 
ing  Elements  (PEs)  involved  in  our  architecture  follows. 
The  column  of  processors  that  perform  the  product 
Ar  L  is  composed  of  n  PEs.  One  of  these  PEs  is  dep¬ 
icted  in  Figure  2;  with  b[  ],  ].  and  /,[  .  )  representing 

any  element  of  the  vectors  b,  ft  and  of  the  matrix  i. 
Every  PE  in  this  column  of  processors  has  a  memory  cell 
to  hold  one  floating  point  number  (orio  element  of  the 
vector  A)  which  can  be  changed  as  explained  at  the  end 
of  Section  V.  The  function  of  each  of  these  PEs  is  to 
compute  the  product  of  the  real  number  at  one  of  the 
inputs  with  the  value  stored  in  memorv  and  add  the 
result  to  the  other  input. 
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We  define  Processing  Element  Time  (PET)  as  the 
total  time  it  takes  a  given  PE  to  compute  the  output- 
including  the  time  to  transfer  the  data  to  a  neighboring 
processor  (with  a  protocol  if  the  system  is  asynchro¬ 
nous).  Note  that  the  PET  is  independent  of  whether  the 
individual  PE  is  internally  pipelinable  (which  will  affect 
the  throughput  rate  for  the  PE).  In  the  literature  on 
systolic-type  arrays,  the  PET  is  usually  called  the  clock 
(which  can  be  misleading),  and  it  is  usually  assumed  to 
be  1.  In  systolic-type  arrays  the  PEs  with  the  largest  PET 
will  determine  when  data  will  be  available  to  neighboring 
processors;  therefore  these  processors  will  lower  the 
throughput  rate  of  the  whole  array.  In  our 
architecture— see  Eqns.  (l?)-(19)-the  largest  PET 
corresponds  to  the  processor  that  computes  the  param¬ 
eters  for  the  elementary  transformation  0/>(l.  j  +  l). 

]  =  n.  n  — 1,  •  •  ■  1;  by  dividing  this  processor  into  four 
stages  (other  numbers  can  be  chosen  if  required  by 
hardware  considerations)  and  by  adding  some  appropri¬ 
ate  delay  banks,  we  make  the  whole  architecture  work 
with  the  PET  corresponding  to  the  simplest  PE  (viz.  one 
of  the  processors  in  the  left  column  of  Figure  1).  From 
now  on  we  will  refer  to  this  PET  as  the  architecture  PET. 
namely,  the  time  it  takes  the  whole  architecture  to 
accept  another  set  of  data;  this  is  the  time  required  to 
perform  a  multiplication  of  two  real  numbers,  plus  one 
addition,  plus  the  time  involved  in  transferring  data  to  a 
neighboring  processor  (see  Figure  2).  Note  that,  in  gen¬ 
eral.  the  PET  is  greater  than  the  clock  period  of  the 
hardware. 

The  processor  that  computes  the  parameters  of  the 
transformation  is  composed  of  of  four  PEs  (see  Figure  3): 
its  structure  allows  us  to  pipeline  the  computation  of 
parameters  for  different  transformations.  We  can  divide  j 
the  computation  of  0p(l.  y  +  1)  into  more  than  four  ! 
stages  to  make  sure  that  every  single  stage  will  take  only  | 
the  architecture  PET  to  compute.  We  assume  four 
stages;  then  it  takes  4  PETs  to  compute  the  transforma¬ 
tion  parameters.  Note  that  only  one  of  the  four  PEs 
requires  a  memory  cell. 

The  transformation  0c(l.  >  +  l)  is  applied,  in  the 
form  shown  in  Eqn.  (16).  by  a  column  of  n  processors. 

A  PE  of  this  group  is  depicted  in  Figure  4;  each  of  these 
PEs  can  be  thought  as  being  composed  of  three  of  the 
PEs  of  Figure  2.  Finally,  the  delay  is  achieved  by 
nx(n  +  a  -  1)  registers,  where  usually  a  =  5-as  dis¬ 
cussed  at  the  end  of  this  section. 

Description  of  the  Input 

The  inputs  to  our  architecture  are  as  follows  (see 
Figure  1):  n  inputs  to  the  column  of  processors  at  the 
left,  plus  one  input  to  initialize  the  h  processors,  plus 
one  input  to  the  processor  at  the  top  of  the  architecture. 
We  use  the  following  notation;  we  represent  the  signals  at 
a  given  point  of  the  architecture  by  a  row  vector  whose 
elements  are  the  numbers  appearing  at  that  point  of  the 
architecture  at  different  instants  of  time  (usually  every 
PET);  the  most  recent  values  are  in  the  leftmost  column. 
Therefore  we  can  talk  about  input  matrices,  comprised 
of  several  rows  containing  the  input  to  several  proces¬ 
sors  at  consecutive  PETs.  In  this  matrix,  each  column 
will  show  all  the  inputs  to  a  defined  set  of  processors  at  a 
given  instant  of  time;  the  next  column  shows  the  input 
after  one  PET.  Similarly  for  output  matrices  [Note  that 
these  systems  can  be  asynchronous;  in  a  row  of  the 
matrix,  data  in  successive  columns  merely  show  the  fact 
that  one  datum  succeeds  the  other,  but  not  necessarily 
after  exactly  one  PET.] 

For  the  column  of  processors  that  computes  the 
product  hT  L  we  have  to  input  the  elements  of  the 
matrix  £  in  Eqn.  (9)  (for  n  -  4) 
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We  shall  read  this  data  in  by  diagonals  starting  at 
the  upper  left  corner;  so  that  the  input  matrix  is 
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In  each  time  interval.  PET,  we  shall  process  one  column 
of  the  matrix  /„.  starting  from  the  one  at  the  right. 

The  first  h  processor  requires  an  initializing  datum 
at  each  PET,  for  which  we  shall  enter  a  row  of  zeros: 

/,  =  [o  0  0  oj 


The  input  to  the  top  processor  (see  Figure  3)  will  be 
based  on  the  weighting  matrix  in  the  left  hand  side  of 
Eqn.  (9),  viz. 
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arranged  as  follows  (recall  that  we  do  not  need  to  take 
the  square  root  of  this  matrix,  see  Section  111) 

Id  =  [fl[t]  D[ 2]  /J[3)  0[4|  •  •  •] 

where  the  asterisks  stand  for  dont  care  inputs.  The 
matrices  /,  and  /D  contain  all  the  information  we  need 
for  processing  one  measurement. 


Description  of  the  Output 


The  output  from  the  column  of  processors  at  the 
right  can  be  described  by  a  matrix.  0„.  which  w,n  be 
delayed  (n-nx4-l)pETs  with  respect  to  /„  (i.e. 

(n+a-l)PETs  due  to  delay  network  plus  2  PETs  due  to 
the  processing  by  the  right  and  left  columns  of  proces¬ 
sors). 
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The  output  from  the  top  processor  is  given  by  On 
which  will  be  delayed  3  PET  with  respect  to  /0  (i.e.  the 
time  it  takes  to  propagate  through  three  stages  of  the 
top  processor): 

00  =[£>.[  1]  £>.[2]  0,[3]  D,[ 4]  •  •  •] 

This  completes  the  description  of  the  architecture 
for  updating  the  covariance  factors  (Eqns.  (2)-(4))  of  the 
measurement  update. 


Updating  the  Slate  Estimate 


An  important  advantage  of  our  architecture  is  that 
the  same  structure  allows  us  to  compute  the  updated 
measurement— viz,  Eqn.  (1);  this  fact  docs  not  follow 
from  Eqn.  (9),  but  from  an  understanding  of  the  archi¬ 
tecture.  It  is  easy  to  show  that  if  we  include  the  stale 
vector  x  in  the  input  matrix  /,  as  follows 
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and  also  initialize  the  h  processors  to  include  I  he  scalar 
measurement  y  as  follows 

/,  =  [0  0  0  -y  0] 


t  Square  bracketi  refer  tv  an  element  of  a  vector  or  x. a t r , * 
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then  the  output  will  have  the  same  structure,  and  will 
include  the  updated  state  vector  x  * 
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The  input  of  the  diagonal  elements  and  the  correspond¬ 
ing  output  remain  unchanged. 

Initialization 

For  the  algorithm  to  work  properly  we  need  to  ini¬ 
tialize  the  memory  cells  in  the  right  column  of  proces¬ 
sors  with  the  value  zero.  This  initialization  can  be  done 
using  the  following  input  matrix  before  entering  1. 

fo  0  ol 


1C. 


0  0 

0  • 


The  structure  of  the  matrix  1C.  will  be  preserved  at 
the  output.  Additionally,  we  need  to  enter  a  string  of 
n-1  zeros  to  initialize  the  first  h  processor 

/C,  =  [  0  0  0  J 

Implementation  Issues 

This  architecture  shows  potential  for  a  practical 
implementation:  each  processor  can  be  implemented 
with  standard  chips  now  being  developed  (see  Fisher  et 
al.  (1983)  and  S.Y.  Rung  (1984)).  Furthermore,  the  flow 
of  information  follows  a  very  simple  and  regular  path. 
However  other  structures  are  also  possible,  including  for 
example  the  use  of  CORDIC  modules  (see  Section  2 
below). 

1.*  EXTENSION  TO  VECTOR  CASE 

As  shown  in  Equation  (9).  in  the  vector  case  we  have 
to  zero  the  pun  matrix  HL  using  as  a  reference  the 
lower  triangular  matrix  Lg.  Each  Givens  transformation 
will  zero  out  one  element;  however,  in  order  to  preserve 
the  elements  already  zeroed  we  must  proceed  in  a  given 
order  that  must  fulfill  the  following  requirements: 

1.  All  the  elements  above  the  one  we  want  to  zero  (in 
the  same  column )  must  be  zeroed  first.  This 
requirement  comes  by  our  choosing  as  a  reference 
the  tower  triangular  matrix  Lg. 

2.  All  the  elements  to  the  right  of  the  one  we  want  to 
zero  (in  the  same  raw)  must  be  zeroed  first.  This 
requirement  comes  by  choosing  a  particular  struc¬ 
ture  for  the  L  matrix  (lower  triangular). 

One  way  to  fulfill  both  requirements  is  to  zero  the 
elements  of  the  HL  matrix  by  diagonals,  starting  in  the 
upper  right  corner,  and  in  every  diagonal  to  start  by 
zeroing  out  the  uppermost  element.  Figure  5  shows  the 
extension  of  the  architecture  to  process  p  measure¬ 
ments  (with  p  =  3  and  n  =  4).  As  expected,  this  archi¬ 
tecture  requires  more  processors  that  the  scalar  one, 
but  presents  the  following  additional  advantages: 

•  The  processors  and  structure  are  the  same  as  for 
the  scalar  update. 

•  It  takes  time  0(max\n,  p|)  (see  below),  as  opposed 
to  O(np)  for  processing  all  the  measurements  using 
the  scalar  architecture. 

•  Processing  the  measurements  one  at  a  time 
requires  the  additional  work  of  uncorrelating  the 
measurements:  using  the  architecture  for  the  vec¬ 
tor  case,  there  is  no  need  for  uncorrelating  the 
measurements. 

Most  of  the  considerations  discussed  in  Section  IV 
for  the  scalar  measurement  architecture  apply  directly 


here,  such  as  the  the  w  jy  the  coefficients  of  the  I "  matrix 
are  introduced  m  the  memory  cells.  The  input  and  out¬ 
put  matrices  !g  and  O0  are  exactly  the  same  .ts  for  the 
scalar  case.  The  irnt Mli/atiori  matrix  will  be  also  the 
i  same,  fCa.  The  vector  of  measurements,  y  .  (p  =  d)  is 

input  as 

;  /,  =  [  0  0  0  -vl  1 1  0 1 

/2  =  [  0  0  0  -y(2l  0  | 

/3  =  [  0  0  0  — y[.>]  0  ] 

To  initialize  the  top  processor  in  the  columns  of 
processors  which  perform  the  H  i  product  we  will  use 

IC,  -  1  0  0  0  ] 

tc2  =  [  0  0  0  •  ] 

1C3  -  [  0  0  0  •  •  ] 

where  the  asterisks  show  don  t  core  values;  here  they  are 
used  to  indicate  proper  tuning.  There  is  much  more  to 
be  said  about  timing,  operation  counts  and  other  com¬ 
plexity  aspects,  for  which  wc  must  refer  to  the  full 
paper. 

2.  Some  Algorithms  for  Tocplitz  Matrices 


The  material  in  this  section  is  based  on  part  of  a 
paper  to  appear  in  Modem  Signal  /Processing  and  VI  SI. 
S.Y.  Rung,  K.  Whitchouse  and  T.  KaiUth  (eds),  prentice- 
Hall.  1984. 

The  by  now  well-known  linear  predictive  coding 
(LPC)  methods  for  speech  analysis  arc  based  on  the 
hypothesis  that  speech  waveforms  can  be  modeled  as  the 
output  of  a  linear  time-invariant  filter  driven  white  noise. 

The  linear  time-invariant  filter  ran  be  modeled  in 
many  ways,  but  for  a  variety  of  reasons  early  attention 
was  focused  on  using  "all-pole'  models,  or  equivalently 
on  modeling  speech  as  a  stationary  autoregressive 
discrete-time  random  process: 

Vi  +  -4.V  1V1  1  +  +  Ay  vVl  V  = 

where  fey,  (  is  a  zero-rnean  white  noise  process.  The 
modeling  problem  is  to  choose  the  order  ,V,  the 
coefficients  (Ay ,(  and  the  noise-variance,  K,  y  say.  so 
as  to  best  fit  the  observed  speech  signal  |y, .  f  >  0( 

The  standard  procedure  is  to  form  the  so-called 
'sample  covariance’  estimate  of  the  second-order  statis¬ 
tic, 


Ri  =  C  y,y,  ,1 


of  the  stationary  process  fy,,  t  >  0),  and  to  notice  that 
the  coefficients  }Ay,|  can  be  obtained  by  solving  ttie 
so-called  Yule-Walker  equations 


Ay, 1  /] 


ff c 

/?,  /?0 


ff.V 


/?, 

Rn  /?,  Ho 


There  is  room  for  disagreement  with  this  formulation 
However,  one  reason  it  is  popular  is  th.it  the  speei  il 
constant-along-diagonals  (Tocplitz)  nature  of  the 
coefficient  matrix  in  U)  lends  itself  to  a  convenient  f  ist 
recursive  solution  algorithm  --  the  so-called  l.evmson 
algorithm: 


A.v,  i(* ) 

z 

Av(* ) 

1 

/M*) 

4c(z)  =  1  =  Hc(z) 

where 


A»(l)  =  Ass  *  4v ,v-i*  A*.,**-1  +■  zN  , 


Bs(*)  -  the  so-called  'reverse  polynomial'  of  A.v(z) 


=  AyyZS  ♦  Ay 

•  Av.iZ  +  l  . 

and 

Ay  mR  i  +  Ay  Jy_ 

i  f?s+...+A.v,i 

iR.V  +  Rnu 

Bf 

(3a) 

=  mi  -*,$M) 

.  =  /?„ 

(3b) 

The  point  is  that  we  successively  build  up  the  solu¬ 
tions  [Asx.N  -  0.1....(.  with  the  major  computational 
burden  being  that  of  forming  the  'inner  product'  for 
**♦1.  which  requires  N  multiplications  and  N  addi¬ 
tions.  Therefore,  computing  j*  lt...,fcw j  will  require  of 
the  order  of 

1  +  2  +  •  •  •  +  N  =  N(N  +  l)/2 

or  0{Nl)  elementary  computations,  which  is  an  order  of 
magnitude  less  than  the  O(N^)  computations  required 
to  solve  a  A  arbitrary  set  of  linear  equations,  i.e..  one 
without  the  special  Toeplitz  structure  of  (l). 

The  Levinson  algorithm  can  be  used  to  produce  a 
family  of  autoregressive  models  of  increasing  order,  and 
we  have  to  decide  in  some  way  on  the  appropriate  order. 
There  are  various  tests  (e.g..  Akaike's  Information  Cri¬ 
terion)  and  other  considerations  (e.g.,  practical  design 
limits  on  integrated  circuit  implementations)  involved  in 
this  choice,  which  we  shall  not  elaborate  here.  What  we 
do  wish  to  emphasize  is  that  Levinson  algorithm  has  a 
special  structure  that  allows  us  some  flexibility  in  mak¬ 
ing  a  decision  on  the  order. 

The  traditional  way  of  implementing  a  Alter  to  com¬ 
pute  »,v.i  is  via  a  transversal  (tapped-delay-line)  filter 
with  coefficients  {A*  0,..,AwjyI,  see  Figure  6:  we  should 
remark,  to  avoid  confusion,  that  the  transfer  function  of 
the  filter  that  computes  e„ ,  is  not  Ajv(z)  but  rather 

Anm  +  +  +  z~N  =  z  ~NAf/(z )  . 


must  lie  within  the  unit  circle,  but  this  condition  does 
not  translate  easily  into  bounds  on  the  coefficients 
)A,v,|.  For  this  and  other  reasons  the  numerical  proper¬ 
ties  of  the  (normalized)  lattice  filter  tend  to  be  belter 
than  those  of  the  transversal  filter.  Therefore,  it  may 
not  be  surprising  that  Texas  Instruments  chose  to  use 
this  structure  when  building  their  very  successful  speech 
synthesis  chip  (Wiggins  and  llrantingham  (1979));  the 
modular  structure,  local  interconnections  and  rhythmic 
data  flow  all  make  for  convenient  VLSI  implementation 
(Mead  and  Conway  (1990).  H.  T.  Kung  (1979),  (1992),  S.  Y. 
Kung  et  al.  (1982)). 

COKDIC  Implementations 

Each  section  in  Figure  7(a)  requires  two  multiplica¬ 
tions  and  two  additions:  the  multiplications  tend  lo  be 
more  expensive  and.  therefore,  rearrangements  have 
been  devised  that  use  scaling  to  manage  with  only  one 
multiplication  (see  e.g..  Market  and  Cray  (1976,  See. 
5.5)).  For  numerical  reasons,  one  often  goes  to  a  section 
with  four  multipliers,  as  in  Figure  7(b).  Somewhat  irom- 
catty,  it  turns  out  that  this  section  can  in  fact  be  imple¬ 
mented  without  using  any  explicit  multiplications  at  all. 
by  using  the  so-called  COKDIC  (Coordinate  Rotation  Digi¬ 
tal  Computer)  implementations,  based  on  the  properly 
shown  in  (5)  below. 

Such  circuits  have  been  proposed  and  already  used 
in  several  applications  (e  g.,  by  Despain  (1974).  (1979). 
Haviland  and  Tuszynski  (19U0).  see  also  Schclin  (1983)): 
in  recent  years  they  have  been  suggested  for  a  variety  of 
one-  and  two-dimensional  VLSI  computing  structures  of 
the  systolic  and  wavefront  array  types  —  see.  e.g..  the 
papers  of  Ahmed.  Delosme  and  Vforf  (19B2),  Rao  and 
Kailath  (19B3a,b),  Dewilde  et  al.  (1994),  and  also  the 
thesis  of  Ahmed  (19B2)  and  Delosme  ( 19B2). 

Parallel  Computation  and  the  Schur  Algorithm 


However,  if  we  were  uncertain  about  our  choice  of  order 
and  wished  to  compute  say  e,v,gj.  then  we  would  have 
not  only  a  longer  transversal  filter  but  we  would  have  to 

reset  all  the  tap-gain  coefficients  from  jA*.o . A*.*!  to 

Mw*z.o . There  are  many  examples  in  which  it 

is  desirable  to  compute  the  filter  response  over  a  range 
of  values  of  N  before  deciding  on  a  fixed  order,  and  this 
is  not  convenient  with  transversal  filter  implementations. 


A  natural  question  with  VLSI  technology  is  to  explore 
the  question  of  how  much  the  determination  of  the 
reflection  coefficients  can  be  speeded  up  by  using  paral¬ 
lel  computation  —  say  with  .V  processors  working 
together.  The  hope  is  that  if  a  processor  takes  one  unit 
of  time  for  each  elementary  computation,  then  we  would 
hope  to  use  N  processors  to  obtain  the  answer  in  time 
0(N)  as  compared  to  0(/V2)  with  a  single  serial  proces- 


Now  the  Levinson  recursion  (2)  shows  that  the 

Jfcy  t  s  1 . JV|  along,  with  the  fact  that 

Ao(* )  = /?o(z )  =  1.  provide  an  alternative  parametriza- 
tion  of  the  filter:  knowing  these  values  allows  us  to  con¬ 
struct  Am(z).  and  hence  the  coefficients  j  Ayv,  i  •  from 
(2).  If  we  wish  to  go  to  a  different  order,  say  N  +  2,  in 
this  parametrization  we  need  only  to  add  two  more 
coefficients  without  having  to  change  any  of 

the  earlier  values.  This  invariance  property  of  the  jit, (- 
parametrization  can  be  exploited  by  using  a  different 
implementation  of  the  filter  in  terms  of  the 
Jfci.  l<t«.V)  rather  than  the  jAjv.ij.  Examination  of 
(2)  suggests  that  we  can  build  the  filter  as  a  cascade  of 
lattice  sections",  as  in  Figure  7(a),  or  in  a  certain  nor¬ 
malized  (see  Eq.  (5)  below)  form  as  in  Figure  7(b). 

It  is  easy  to  calculate  the  inverse  of  the  lattice 
filters  in  Figure  7  by  using  signal  flow  graph  rules  —  the 
result  of  doing  this  for  Figure  7(b)  is  shown  in  Figure  8 
and  provides  the  "modeling"  filter.  It  has  the  form  of  a 
discrete  transmission  line,  and  helps  explain  why  the 
are  often  called  rtfltetion  coejficipnts. 

This  physical  interpretation  suggests  that  we  must 

have 

1*4  I  »  1  (4) 

•  f®ct  that  also  follows  from  (3b)  (since  the  variances 
Inf  I  must  be  nonrregative).  There  is  a  corresponding 
constraint  on  A*(* ):  the  roots  of  the  polynomial  z  vA(z  ) 


However,  it  is  not  hard  to  see  that  a  parallel  imple¬ 
mentation  of  the  Levinson  algorithm  will  take  time 
0(N  log  .V)  because  of  the  inner  product  operation 
needed  to  form  the  jfc,(  --  N  processors  can  carry  out 
the  N  multiplications  needed  to  form  in  parallel 

in  one  time  unit,  but  the  additions  required  will  take  at 
least  log  JV  steps. 

This  seems  disappointing,  but  there  is  hope.  It 
turns  out  that  there  is  a  mathematically  equivalent  algo¬ 
rithm,  traceable  to  Schur  (1917),  that  avoids  the  inner 
product  step  in  forming  the  j t,(  and  thus  will  allow 
implementation  in  time  O(N)  with  /V  processors. 

The  Schur  algorithm  was  originally  presented  in 
1917  as  a  test  to  see  if  a  power  series  was  analytic  and 
bounded  in  the  unit  disc,  in  our  problem  it  reduces  lo  a 
simple  three  step  procedure  (see  Dewilde,  Vieira  and 
Kailath  (197H)  and  l.ev-Ari  and  Kailath  (1984))  for  the 
computation  of  the  )fc,(  associated  with  a  covariance 
sequence  j/f0./f, . ffvj. 

Step  1:  Start  with  a  generator  malnx  (superscript 
*  denotes  matrix  transpose) 


and  shift  the  first  column  down  one  row  to  get 
0  /?c  fl.v  ij 

g;  = 

0  /?JV 


Step  2:  Compute  k  ,  as  the  ratio  of  the  (2,2)  and 
(2,1)  entries  of  Gt. 


Step  3:  Form  a  matrix 

1  [  1  -*> 
e(*,)=  . 


Vi-fcf  -t,  t 


and  apply  it  to  Gt  to  obtain  a  new  Schur-reduced  gen¬ 
erator  of  the  form 


c;  =  0(fc,)c;  = 


XX...  zl 

Ox...  zl 


where  the/Z  denotes  elements  whose  exact  value  is  not 
relevant  at  the  moment;  9  is  known  as  a  J  -rotation 
matrix  because 

ll  0  1 

9 (k)  /  0'(fc)  =  /  .  J  = 


Its  role  is  to  rotate  in  the  /-metric  (i.e...  to  hyperboi  •) 
the  2nd  row  of  G  to  lie  along  the  first  coordinate  direc¬ 
tion. 

Now  we  can  repeat  Steps  1,  2,  3  to  obtain  k2  and  a 
new  reduced  generator  matrix  Gg.  And  so  on. 

It  can  be  shown  that  the  computed  in  the 

Schur  algorithm  are  exactly  the  same  quantities  as 
defined  before  in  the  Levinson  algorithm.  However,  note 
that  the  Schur  algorithm  only  requires  sets  of  2  x  1  row 
vector  by  2x2  matrix  multiplications,  which  can  be 
carried  out  in  parallel  at  each  stage.  Therefore,  we  will 
need  only  0(N )  time  units  to  carry  out  the  Schur  algo¬ 
rithm  with  N  processors,  as  compared  to  0(\  log  A’) 
for  the  Levinson  algorithm. 

In  fact,  a  parallel  and  pipelined  lattice  VLSI  comput¬ 
ing  structure  based  on  the  Schur  algorithm  has  already 
been  designed  and  built  (see  Kung  and  Hu  (19B3). 

Cholesky  Factorization 

The  Schur  algorithm  is  also  closely  connected  with 
the  so-called  Cholesky  factorization  of  the  Toepiitz 
covariance  matrix  R  in  (1).  Such  factorizations  are 
important  in  many  calculations  involving  the 
corresponding  stochastic  process.  The  Cholesky  factor, 
say  C.  is  the  unique  lower  triangular  matrix  with  posi¬ 
tive  diagonal  entries  such  that 

R  =  CC' 

It  turns  out  that  C  is  immediately  determined  by  the 
Schur  algorithm; 

the  i-th  column  of  C  =  the  first  column  of  G,  ,  (8) 

where  the  [GiJ  are  the  reduced  generator  matrices 
obtained  in  the  Schur  algorithm.  Thus  we  have  a  fast 
algorithm,  using  0(A*)  computations,  for  Cholesky  fac¬ 
torization  of  a  Toepiitz  matrix,  as  compared  to  0(.Y3)  in 
general,  (see  Bareiss  (1969).  Morf  (1970).  Rissanen 
(1973)). 

Transmission  lane  Interpretations  and  Inverse 
Scattering  Problems 


the  )G,(  means  that  the  values  at  the  input  of  the  i-lh 
delay  clement  are  the  entries  of  the  j-th  column  of  the 
Cholesky  factor  C.  Consequently,  we  c.iri  .ilso  s.iy  that 
the  values  at  the  inputs  of  the  delay  elements  at  any¬ 
time  l  are  the  entries  of  the  1-th  row  of  the  Choh  -ky 
factor  C.  [Incidentally,  this  interpretation  removes 
much  of  the  mystique  sometimes  associated  with  the  so- 
called  fast  Cholesky -by-column  arid  fast  Cholesky-by-row 
algorithms. ) 

This  transmission-line  interpretation  suggests  that 
the  computations  could  be  carried  out  by  exciting  -truc- 
turcs  as  in  Figure  9  with  light  or  other  electromagnetic 
waves,  depending  upon  the  medium  chosen  for  imple¬ 
mentation  (e  g.,  fiber  optics  or  acoustic  w  ives  in  solids, 
etc.). 

In  fact,  we  might  mention  that  starting  with  the 
graphical  representations  in  Figure  9,  and  using  some 
fundamental  wave  propagation  and  energy  conservation 
concepts,  we  can  obtain  simple  proofs  of  all  the  above 
results  and  in  the  process  develop  intimate  and  useful 
connections  between  transmission  tine  theory.  Cholesky 
and  inverse  Cholesky  factorization,  inverse  scattering 
theory  and  the  Schur  and  dual  Schur  algorithms  (sec 
Bruckstein  and  Kailath  (19ti'l)).  This  transmission  line 
picture  was  also  helpful  in  obtaining  certain  orthogonal 
cascade  implementations  of  *■ ' i V A  tpole-ar.d-/ero)  filters 
with  excellent  numerical  properties  (Rao  and  Kailath 
(19  id))  and  excellent  potential  for  VLSI  implementation 
On  the  other  hand,  we  can  approach  the  results 
from  quite  another  direction  by*  noting  that  Cholesky  fac¬ 
torization  is  actually  a  way  of  testing  for  positive 
definiteness  of  a  matrix;  if  we  now  ask  what  matrix  struc¬ 
tures  lend  themselves  to  easy  testing  in  this  way,  we 
shall  be  led  to  a  class  of  matrices  (with  "displacement'' 
structure)  Df  which  the  Toepiitz  family  is  only  one  special 
case  (Lev-Ari  and  Kailath  (laftt)).  By  this  route,  all  the 
previously  mentioned  results  on  parallel  compulation 
and  transmission  line  interpretations  and  implementa¬ 
tions  can  be  carried  over  to  a  much  larger  class  of  prob¬ 
lems  as  we  shall  briefly  review  in  the  next  section. 

3.  Minimal  Realizations  and  Decoding  of  BCll  Codes 

Consider  a  discrete-time  system  with  transfer  function 

H{z)  =  t  M‘* 

i 

so  that  the  [A*)  define  the  impulse  response  of  the  system. 
Suppose,  to  take  a  very  simple  case,  that  we  know  the  sys¬ 
tem  is  finite  dimensional  of  order  n,  and  that  we  wish  to 
find  polynomials  [a(z),6(z)(  such  that 

H(z)  =  b  (z)/  a(z)  , 


a(z)  -  z" +a,zn'l-t ■•.  +  Or. 

6(z)  =  b  i®""1  +  +  bn 

Then  it  is  easy  to  check  that  we  have  the  relation 


hi  1  M  I'M 


O . 


hill  t) 


A  good  indication  of  a  natural  connection  between 
the  Schur  algorithm  and  potential  VLSI  or  other  imple¬ 
mentation  methods  can  be  obtained  from  the  graphical 
representations  of  ihe  Schur  algorithm  shown  in  Figure 
9.  Then  the  above  result  on  the  relation  between  C  and 


From  ti.is  set  of  equations,  it  is  clear  that  b(z)  may  be 
easily  found  via  back  substitution  once  a(z)  is  known 
Thus,  the  main  work  involved  is  to  find  a(z)  by  solving  the 
Hankel  matrix  equation. 


On 

0 

• 

Ol 

= 

1 

0 

General  methods  of  solving  n  linear  equations  in  n  unk¬ 
nowns  require  0(n3)  elementary  computations.  However, 
here  the  special  Hankel  structure  can  be  exploited  to  solve 
the  equations  with  0(.V2)  computations.  This  result  was 
perhaps  first  noted  in  the  coding  theory  literature,  where 
Berlekamp  (1969)  developed  an  important  decoding  algo¬ 
rithm  for  8CK  codes,  which  essentially  involved  the  solution 
of  Hankel  linear  equations  over  Galois  field  with  elements 
}0.1|;  Massey  (1969)  later  showed  the  relation  to  minimal 
realization  problems  over  this  field.  The  Berlekamp-Massey 
algorithm  can  be  in  its  simplest  form  described  as  follows: 

a(z)  =  A <*>(*) 
where 

A!**°(z)  1  -Atn*  Ia!*’(*)  (a°(z)  fl 

=  .  =  (3) 

j p*ti<$**i  £?°(z)  1 

<5*  4-1  =  1  —  <5*  60  =  0 

\ k/$ 

a*m  =  £  a, <*>**♦,-,  (4) 

1=0 

By  "simplest  form"  we  mean  that  the  coefficient  matrix  is 
assumed  to  have  all  leading  minors  nonzero.  This  unrealis¬ 
tic  assumption  can  be  removed,  at  the  cost  of  some  further 
complexity  in  the  algorithm.  We  ignore  this  (important) 
refinement  here,  because  our  point  is  that  the  inner  pro¬ 
duct  in  (4)  prevents  effective  parallel  implementation  of  the 
Berlekamp-Massey  algorithm:  with  n  processors  we  will 
need  0(nlogn)  elementary  computations  rather  than 
0(n)  as  we  might  expect. 

An  alternative  method  of  solving  the  linear  equations 
can  be  based  on  attempting  a  triangular  factorization  of  the 
coefficient  matrix  in  the  form 


k 

|l  *  I  •  d„ 

0  1  X 

1  1 

II 

0 

k  Azn 

p  0  .1 

t  ■  ■  :  0] 

More  compactly,  we  denote  this  as 

M(n,n  +  +  l.n  +  1)  =  Q(n,n  +  1) 

The  factorization  of  M  is  generated  by  recursively  comput¬ 
ing  the  columns  of  P.  This  leads  to  recursions  of  the  form 

P*n(*)  =  (*  ~7 *)/>*(*)  “  i*p*-i(z) 

where 

*  =  "  0iQ>  fi&l  ' 

p*(z)  is  a  polynomial  formed  from  the  (fc  +  l)’*  column  of 
P  with  the  top  element  assigned  the  power  z°  and  so 
forth.  is  the  ith  element  below  the  main  diagonal  in 

the  (fc  +  1)*'  column  of  Q.  e.g.,  fl jJ0)  is  the  element  in  the 
(fc+-1)*<  column  that  is  on  the  main  diagonal.  This  ''three- 
term"  recursion  is  standard  in  the  theory  of  orthogon.il 
polynomials  and  in  fact  the  above  polynomials  are  already 
known  in  that  literature  as  Lanczos  polynomials.  Again  the 
classical  theory  has  to  be  extended  when  .t/(n,n  +  1)  can 
have  zero  leading  minors;  this  can  be  done  by  replacing  the 
term  (*  -7*)  in  (6)  by  a  polynomial  in  z  of  higher 
degree;  see  Kung  ((1977),  Ch.  4).  The  main  point  here,  how¬ 
ever,  is  that  the  coefficients  \0k.yk\  in  (6)  arise  naturally  in 
each  step  of  the  factorization,  and  no  inner  products  are 
involved.  Therefore,  as  noted  by  Todd  Citron,  the  Lanczos 


recursions  arc  better  suited  f or  par  ulci  implement ation.  In 
his  thesis  research  at  Stanford.  Citron  has-  devdoped  a  fas! 
highly  parallel  implementation  of  t tie  gi-ncr  ili/t-d  .'.aievo- 
recursions  for  solving  the  partial  realization  and  l!'.':  decod¬ 
ing  problems. 

4.  VIS1  Implementations  of  Signal  Processing  Operations 

In  the  last  few  years,  there  has  been  great  activity  in 
the  area  of  systolic  array  implementation  of  basic  signal 
processing  operations  such  as  convolution,  polynomial  mul¬ 
tiplication  and  division,  matrix  multiplication  and  l,Lr  and 
Cholcsky  matrix  decomposition  (see,  e.g..  H.  T  Kung  (1992) 
and  the  chapter  by  Kung  and  Leiserson  in  Mead  and  Conway 
(1990)).  Systolic  arrays  have  the  advantages  of  modularity, 
local  interconnectedness  and  regular  data  flow  and  there¬ 
fore  are  being  studied  as  prime  candidates  for  VLSI  imple¬ 
mentation.  However,  numerical  issues  important  for  actual 
implementation,  such  as  sensitivity  to  coefficient  truncation 
and  rounding  and  susceptibility  to  overflow  and  limit-cycle 
oscillations,  do  not  appear  to  have  been  addressed  as  yet. 
In  Rao  and  Kailath  (1993b),  it  is  shown  that  most  presently 
known  systolic  arrays  would  in  fact  suffer  from  such  prob¬ 
lems.  because  they  can  be  interpreted  as  rearrangements  of 
the  so-called  controller  and  observer  (or  direct)  canonical 
form  of  linear  system  theory;  the  rearrangements  are 
essentially  those  required  to  make  these  canonical  forms 
pipelineable.  In  linear  system  theory,  it  is  known  that 
numerical  problems  can  be  reduced  by  using  cascade  reali¬ 
zations.  and  that  in  fact  overflow  and  limit  cycle  oscillations 
can  be  completely  avoided  by  using  cascades  of  orthogonal 
sections.  The  Schur  algorithm  described  in  Section  2  can  be 
used  to  derive  two  new  classes  of  pipelined,  systolic  ARMA 
digital  filters  of  this  type.  In  the  AK  (all  pole)  case,  they  are 
exactly  the  lattice  filters  shown  in  Higs.  7  and  9  In  the 
ARMA  (pole-zero)  case,  we  have  single  input,  iwo-output 
structures  of  the  types  shown  in  Kig.  10.  these  are  derived 
by  applying  energy-conservation  constraints  to  the 
transmission-line  analysis  mentioned  at  the  end  of  Section 
2. 

There  is  a  tradeoff  between  chip  area  and  computation 
time  in  VLSI  implementation  and  several  authors  have 
developed  lower  bounds  for  the  area  in  terms  of  a  quadratic 
function  of  the  computer  rate.  Computer-science  analyses 
of  signal  processing  chips  have  not  generally  taken  into 
account  the  rationality  of  the  transfer  functions  that  may 
be  involved:  by  doing  this,  tighter  lower  bounds  have  been 
obtained  in  Rao  and  Kailath  (1993b).  It  turns  out  that  the 
observer  and  controller  canonical  forms,  and  the  parallel 
cascade  and  orthogonal  cascade  structures  mentioned 
above  all  meet  the  complexity  lower  bounds;  however,  the 
widely-used  (and  very  low  sensitivity)  wave-digital  filters  and 
most  noncanonical  state-space  realizations  do  not. 


5.  Computer-Aided  Knginccring  Analysis  and  Design 

A  less  analytical,  but  no  less  significant  development  in 
the  VLSI  era  is  the  growing  availability  of  desk  top  comput¬ 
ers  and  work  stations  with  powerful  simulation,  graphical 
and  computational  capabilities.  Dramatic  improvements 
can  be  expected  in  the  level  and  sophistication  of  advanced 
theoretical  techniques  that  can  be  brought  to  bear,  in  often 
user-transparent  and  very  user-friendly  ways,  on  practical 
problems.  Not  the  least  significant  will  be  the  development 
of  the  proper  combination  of  human  skills  and  machine 
capabilities  that  Professor  Rosenbroek  presciently  luentified 
several  years  ago  (1990  CDC  Plenary  Lecture;  also  in  >'ys- 
lems  Sc  Control  I.ettP.rs,  Vol.  1.  pp.  2-6.  July  1991)  as  an 
essential  ingredient  in  rightly  shaping  the  relationship 
between  man  and  machine,  automation  and  society. 


0.  Concluding  Remarks 

This  written  version  is  only  meant  to  provide  some 
more  background  and  detail  on  the  major  technirai  topics 
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that  I  expect  to  use,  to  varying  degrees,  in  the  actual  lec¬ 
ture.  1  am  indcbteu  to  several  colleagues  and  students, 
includirg  K.  Ahmed,  T.  Citron,  J.  Delosme,  P.  Dewilde,  N. 
Gupta,  H.  Jagadish,  J.  Jover,  S.  Y.  Kung.  D.  T.  Lee,  H.  Lev-Ari, 
M.  Morf,  S.  K.  Rao,  S.  Shah  and  H.  Whitehouse,  for  many 
patient  discussions  and  talks  that  have  helped  shape  my 
understanding  of  a  fascinating  new  set  of  problems  and 
applications.  1  am  also  grateful  to  the  conference  organiz¬ 
ers,  and  especially  S.  Marcus  and  J.  Melsa,  for  the  invitation 
to  present  this  lecture. 
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Figure  1-  Linear  f  Jctwork  for  SeaJar  Measurement  (n  =  4) 
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Figure  6.  Architecture  for  Vector  Measurement  (n  =  4,  j>  =  3) 


a)  Whitening  Filter 
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b)  Mode  ling  Filter 
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